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* ABSTRACT 



The emphasis on processors that are both low power and high performance has resulted in the 
Incorporation of dynamic voltage scaling Into processor designs. This feature allows one to make 
fine granularity tradeoffs between power use and performance, provided there Is a mechanism in 
the OS to control that tradeoff. In this paper, we describe a novel software approach to 
automatically controlling dynamic voltage scaling In order to optimize energy use, Our mechanism 
Is Implemented In the Linux kernel and requires no modification of user programs. Unlike previous 
automated approaches, our method works equally well with irregular and multl prog rammed 
workioads. Moreover, It has the ability to ensure that the quality of Interactive performance Is 
within user specified parameters. Our experiments show that as a result of our algorithm, 
processor energy savings of as much as 75% can be achieved with only a minimal impact on the 
user experience. 
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Automatic Performance Setting for Dynamic Voltage Scaling 

KRJSZTIAN FLAUTNER, STEVE REINHARDT and TREVOR MUDGE 
Advanced Computer Architecture Lab, The University of Michigan, 1301 Seal Ave., Ann Arbor, MI 48109, USA 



Abstract The emphasis on processors that are both low power and high performance has resulted in the incorporation of dynamic voltage 
scaling into processor designs. This feature allows one to make fine granularity tradeoffs between power use and performance, provided 
there is a mechanism in the OS to control that tradeoff. In this paper, we describe a novel software approach to automatically controlling 
dynamic voltage scaling in order to optimize energy use. Our mechanism is implemented in the Linux kernel and requires no modification 
of user programs. Unlike previous automated approaches, our method works equally well with irregular and multiprogrammed workloads. 
Moreover, it has the ability to ensure that the quality of interactive performance is within user specified parameters. Our experiments show 
that as a result of our algorithm, processor energy savings of as much as 75% can be achieved with only a minimal impact on the user 
experience. 

Keywords: dynamic voltage scaling, power management, performance-setri ng, interactive performance, response time 
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1. Introduction 

The performance of microprocessors has been irrrproving at 
an exponential rate and this trend is likely to continue for sev- 
eral years to come. However, increased performance does 
not come for free. One of the most important consequences 
of higher performance has been a dramatic increase in power 
consumption. While an Intel 386 processor consumes about 
2 watts of energy, a Pentium 4 can use more than 55 watis. In 
a mobile environment, batteries have not kept pace with the 
increased energy requirements, which means thai either ap- 
plication performance or battery time suffers. However, even 
in environments where energy storage is not an issue, energy 
cost and heat management may become problems [10] . 

There is still a need to continue to improve processor per- 
formance, since not all applications are "fast enough", but an 
increasing number are. A way to bridge the gap between 
high performance and low power is to allow the processor 
to run at different performance levels depending on the ap- 
plication's requirements. Some processors, such as the Intel 
XScale [2] and Transmeta Crusoe [8] allow the frequency of 
the processor to be reduced with proportional reduction in 
voltage. Slowing down frequency without voltage scaling is 
not useful, since the power savings is offset by an equal in- 
crease in execution lime, yielding no reduction in the total 
amount of energy consumed. However, since energy is pro- 
portional to the square of the voltage, reducing the operating 
voltage can yield significant energy savings [14]. The central 
issue with processors whose performance can be changed is 
how the right level of performance can be obtained. The goal 
is to reduce the performance of the processor without causing 
an application to miss its deadlines (see figure I). Complet- 
ing a task before its deadline and then idling is less energy 
efficient than running the task more slowly to begin with, and 
meeting its deadline exactly. 

Our aim is to design an algorithm that balances energy sav- 
ings with the following requirements: 




Time 

Figure 1. Performance scaling. This figure shows two different runs of the 
same worldoad. In A, the workload runs at full speed and finishes well in 
advance of its deadline. In B, (he execution of the workload is stretched to 
its deadline, which allows for energy savings on processors that implement 
voltage scaling. 

• No modification of user programs. 

• Works with irregular and multiprogrammed workloads. 
9 Ensures that user-perceived performance does not suffer. 

Previous interval- based approaches to automated performance 
setting did not fully achieve the goals outlined in the last 
two points. These approaches focus on the ratio of idle- 
to busy-time as the indicator of the right performance set- 
ting [5,6,13,16]. While the results looked promising for reg- 
ular workloads (such as audio playback, where processor uti- 
lization is periodic), the proposed schemes do not work well 
for interactive or irregular applications. 

The aforementioned papers point out that looking at idle 
time alone as the indicator of the right performance level is 
not sufficient. In their future work section, Weiser et al. pro- 
pose an alternative approach, where jobs are classified into 
background, periodic and foreground classes [16], They sug- 
gest that the added semantic information could be used to im- 
prove the scheduling algorithms. Govil et al., in their future 
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work section, propose a similar solution, where process type 
along with information specified by the processes {e.g., dead- 
line) could be used for performance setting [5]. Our approach 
follows along the lines of these earlier works; however, we 
derive deadline and classification information automatically 
from the OS kernel, by examining the communication pat- 
terns between the executing tasks (section 3). This informa- 
tion is used to isolate execution episodes corresponding to 
different communication patterns. We can classify execution 
episodes into one of the following categories: interactive, pe- 
riodic producer, and periodic consumer. These classifications 
can be used to derive deadlines for the execution episodes. 
For example, for an interactive episode, the deadline is the 
perception threshold, which we assume to be between 50 ms 
and 100 ms. The deadline for a producer is the point at which 
the consumer actually needs the produced data. Episode clas- 
sification guides performance-setting decisions on a per-task 
and per-cpisode basis and is also used to dynamically evaluate 
the impact of past decisions on user-perceived performance 
(section 4). 

We focus primarily on interactive applications, since we 
believe that this is one of the most difficult, but also the 
most important class of applications for performance scal- 
ing, We also consider the effects of a concurrently running 
background application (an MP3 player) and a variety of as- 
sumptions about the power and performance models on the 
performance-setting strategy (section 6). 

2. Previous work 

In the context of real-time systems, researchers have explored 
voltage and frequency scaling as a means of reducing power 
consumption. Papers [7,11,15] present algorithms and theo- 
retical models that allow one to incorporate voltage schedul- 
ing into real-time schedulers. However, these papers are 
not directly applicable to general-purpose operating systems, 
since the workloads are expected to have well defined charac- 
teristics (periodicity, resource requirements, deadlines, etc.). 
Moreover, the user must explicitly specify these characteris- 
tics to the scheduler. 

Our research is more closely related to the work described 
in [5, 14,1 SI, where performance-setting decisions are made 
automatically, guided by the ratio of idle time to busy time 
on the processor. A weakness with the existing approaches 
are that they are not very accurate and can be easily confused 
by irregular processor utilization. We improve on these al- 
gorithms by using task-level mfbrmation form the OS kernel. 
A more detailed summary of the directly related work follows 
below. 

The main ideas behind automatic performance setting 
were sketched out by Weiser et aL [16]. Their mechanism 
uses the amount of idle time as the guide for finding the op- 
timum level of performance. The practical policy proposed 
in this paper is called PAST. In this policy the utilization for 
the most recent interval is computed and, if it is above a cer- 
tain threshold performance, is increased, but if the interval 
includes mostly idle time, performance is reduced. 



0X1 



While the PAST algorithm looks very attractive due to 
its simplicity and effectiveness on some benchmarks, Govil 
et aL [5] point out its shortcomings and propose improve- 
ments. One of their complaints is that PAST looks back only 
at a single interval and thus it smooths speed poorly: the algo- 
rithm keeps on changing performance levels without coming 
to a steady state, missing out on opportunities to save power. 
To remedy the situation, the authors propose a number of al- 
gorithms that use varying amounts of prediction to improve 
their accuracy. They conclude that smoothing, rather than so- 
phisticated prediction might be the most effective. Accord- 
ingly, they propose an algorithm, PEAK, that looks for recur- 
ring patterns of processor utilization, with special attention to 
short bursts of high utilization, and attempts to set processor 
speed accordingly. 

Pering et al. evaluate interval -based voltage scaling algo- 
rithms for use on a handheld device [13]. One of the key con- 
tributions of this paper is the use of the clipped-delay metric, 
which takes into account that the length of some events can K . <SjF$ 
be increased without affecting the user (see section 4.1 .1). In QJLJ 
practice, the effectiveness of this technique depends on the 
allowed increase in delay. The algorithm may degrade perfor- 
mance slightly but yield significant power reduction. While 
the algorithms used in this paper worked, their performance 
fell short of optimum. Moreover, the algorithms used some 
specific knowledge about the executing programs, which may 
be impractical on a production system, Pering et al. show that 
while interval-based voltage scaling algorithms work well on 
benchmarks with regular processor utilization (such as audio 
playback), they do not fare well on irregular workloads, such 
as interactive workloads or video playback. 

These conclusions are corroborated by Grunwald et al. [6]. 
They also rind that using a weighted average of processor 
utilization as a guide to future utilization (the AVGn policy) 
does not yield the clock speed that would rnaximize proces- 
sor utilization. Another problem with this algorithm is that 
the requirement to average N intervals introduces a delay in 
responding to processor demand. The authors find that exist- 
ing heuristics did not fare as well on an actual implementation 
as previous studies had suggested. 

Lorch et al. improve the performance of interval-based 
scaling algorithms by taking a task's work requirement prob- 
ability distribution into account [9]. This information is used 
to gradually increase the speed of the processor during exe- 
cution, yielding an improvement in energy savings. A similar 
insight led us to the strategy for setting the performance level 
for interactive episodes and for deriving the PanicThreshold 
(see equation (4)). 



3. Episode detection 

We first applied our interactive-episode detection algorithm 
to measure the effects of multiprocessing on interactive per- 
formance [3]. In this section we summarize this methodol- 
ogy and also show how it can be extended to find periodic 
episodes. 
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The first step in determining the optimum level of per- 
formanee is to distinguish the important parts of the execu- 
tion from unimportant periods. We use the communication 
characteristics of the executing applications as the basis of 
this classification. Episodes are triggered by communication 
events with specific tasks but multiple tasks may be involved 
during the execution of an episode. For example, an inter- 
active episode involving Ghostview is triggered by a mes- 
sage from the X server to Ghostview, In return, events are 
processed by the application and it may send messages to 
the X server, the window manager and its rendering engine 
(Ghostscript). All of these processes are part of the episode. 

There are two principle groups of episodes: periodic and 
inter active. Periodic episodes may be further categorized into 
producer and consumer, where the communication between 
these episodes establishes their performance level. All other 
processor activity is classified as background activity. It is 
important to note that during its lifetime a task can fall into 
more than one of these classifications. For example, a music 
playback process may be part of an interactive episode when 
it is updating the GUI and be a producer when it is decoding 
music data. 

We monitor which tasks communicate with a few well- 
known system tasks (such as the X server and the sound dae- 
mon). These tasks are then monitored for communication 
through specific system calls that are then used to classify 
them into one of the above categories. In addition, we collect 
run-time statistics about processor utilization. Thus, instead 
of relying on the programmer, we extract the necessary infor- 
mation from the system automatically, using simple changes 
to the OS kernel. 

As a consequence of our approach, the only idle time that 
shows up within an episode is due to device or communication 
latencies (hard idle-time) and cannot be removed by perfor- 
mance changes of the processor. Soft idle-time, on the other 
hand, occurs between episodes and is mostly due to latencies 
inherent in user interactions. This type of idle time can be 
reduced by slowing down the execution of episodes. 

3. L Examples 

Perhaps the easiest way to understand what episode detection 
accomplishes is to take a look at figure 2. This figure shows 
three execution traces, where the different types of episodes 
are highlighted in different colors and an abbreviation of the 
episode type is shown next to a few key episodes (IE - inter- 
active episode, PE - producer episode, and CE - consumer 
episode). The episode classification is exactly the same as 
it would be during run-lime, no postprocessing takes place 
(based on knowledge of the future) to derive the exact begin 
and end timestamps of the episodes. In the traces, a vertical 
bar of unit width represents one millisecond of execution. The 
vertical length of the line corresponds to the utilization of the 
processor in that quantum. Each line is colored according to 
the type of episode during the execution quantum (black, if it 
is not part of any specific episode). In some cases, especially 
in trace B, a single vertical line may be made up of multiple 
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Figure 2. Episodes during execution. The figure shows three execution 
traces, where the differeni types of episodes have been highlighted. The 
first two (rares are from the Acrobat Reader benchmark and include an MP3 
player running in the background* the third trace is from Netscape accessing 
web servers on the internet (no MP3 playback in background). IB stands for 
interactive episode, FE for producer episode, and CE for consumer episode. 

episodes, in which case the colors are proportional to the ex- 
ecution lengths of each type of episode during the quantum. 

Trace A is representativevOf execution during interactive 
applications. It includes two significant interactive episodes 
along with producer and consumer episodes that are triggered 
by the MP3 player that is executing on the machine. The first 
thing to note is that the detection mechanism is not confused 
by overlapping episodes. The MP3 player kicks in twice (pro- 
ducer episodes) during the first interactive episode, and the 
classification mechanism accurately attributes the consumed 
processor time to it. The sound daemon (consumer episodes) 
wakes up once every 20-23 ms and runs for about a third of 
a millisecond. During this time it checks whether there is 
enough data in the sound card and triggers the producer to 
decode more data when necessary. This behaviour explains 
why the producer episode is always preceded immediately by 
a consumer episode. The checking is accomplished by the 
sound daemon periodically polling the sound device for data 
requests using the select system calf. Note chat the only 
code that gets executed during these short periods is the code 
corresponding to the select system call in the kernel , which 
checks for activity on the monitored devices. The periodicity 
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of these consumer episodes is determined by bow often the 
kernel schedules processes that are blocked on the pol 1 or 
select system calls while waiting for activity. The peri- 
odicity can be configured by the caller of the system call but 
programmers often just rely on the default value in the ker- 
nel. When the sound device runs out of data, the select 
call returns - indicating the need to generate more sound data 
- which in tum wakes up the MP3 player that Is blocked in 
the write system call between requests. 

This data also illustrates that even a lightweight periodic 
process can have a significant impact on the user-perceived re- 
sponse time. During the run of the interactive episode, which 
Lasts for 21 8 ms, the MP3 player and the sound daemon use up 
about 16 ms of execution lime, causing about an 8% increase 
of the response time. 

Trace B shows the bursty activity resulting from the user 
moving a mouse over the screen, where a document in Ac- 
robat Reader is displayed. In this trace* one sees many short 
interactive episodes instead of the long ones in trace A. Just 
as before, the periodic producer episodes ere running, bow- 
ever it is more difficult to visually distinguish the consumer 
episodes from the interactive episodes. There is only one long 
interactive episode (at around 300 ms), but there are interac- 
tive episodes in almost every quantum that has a utilization of 
about a quarter or more. Short interactive episodes are spaced 
at about 10 ms apart. The distance between these episodes 
is determined by the X server, which controls the quality of 
the user experience. Our X server, when it observes rapid 
mouse movement, increases screen updates to improve the 
user experience. The interactive episodes are short because 
the computation required to update the position of the mouse 
and to redraw affected regions of the screen are very simple. 
The only heavier-weight interactive episode in this trace lasts 
for about 3.5 ms is a result of a change in the appearance of 
the cursor when it passes over a special region of the Acrobat 
Reader application. 

Trace C illustrates a long interactive episode with a high 
percentage of idle time from a run of the Netscape bench- 
mark. The interactive episode starts at around 35 ms and ends 
at 468 ms, and it corresponds to a user loading a web page 
from a server on the internet. The idle time is due to I/O la- 
tency while the page is loaded from a remote server. Initially 
there are only small bursts of activity, mostly dealing with 
progress updates but as the requested data starts coming in 
(at around 280 ms), the rendering engine kicks in and starts 
generating output to the screen. Our episode detection mech- 
anism accurately attributes the entire episode as an interactive 
episode instead of breaking it into smaller disjoint parts. This 
example also illustrates a shortcoming of our scheme, which 
we believe is a fundamental problem with kernel level episode 
detection: for a user, it might he sufficient to wail until the 
first screen of data is rendered instead of waiting for the en- 
tire web page to be ready. However, without modifying the 
web browser, there is no way of knowing when the window 
update is done. One could add further hooks into the web 
browser to accurately signal if the user is really interested in 
data at the bottom of the page. 



3.2. Implementation 

Our episode detection mechanisms was purposely designed to 
be as autonomous from other parts of the kernel as possible. 
Incorporating these techniques into an existing kernel requires 
only a small number of hooks, but most importantly it does 
not require changes to existing scheduling algorithms and 
policies. This contrasts with the approach taken by other re- 
searchers that treat the performance -setting problem as a twist 
on existing scheduling algorithms [7,11,15]. These schemes 
usually require perfect knowledge about episode deadlines, 
which need to be specified either by the programmer or the 
user of the system. 

Most applications under UNIX communicate using sock- 
ets, signals, and pipes. In particular; the X server uses sockets 
to communicate with its clients. We do not track interactions 
via other methods such as System V IPC and shared mem- 
ory since our benchmarks do not use them. By tracking the 
communications between the tasks, we are able to determine 
which tasks have an effect on interactive performance. Un- 
like other operating systems (e.g., 'Windows NT), Linux does 
not differentiate between threads and processes. Threads are 
implemented using regular processes and the clone system 
call. We use the name 'task" as a synonym for both threads 
and processes. The implementation that performs the track- 
ing is as non-invasive as possible. The difficulty was not in 
the actual implementation but in finding all the parts of the 
kernel that needed to be tracked. Currently we track commu- 
nications through the following system calls: 

kill, pread, pwrite, read, readv, recv, 
reevfrora, revrasg, send, sendmsg, sendto, 
write, writev. 

We Instrumented each of these system calls to emit a trace 
of the signals, inodes, and sockets that they are accessing. 
The socket information is output instead of the inode number, 
when a socket is accessed through an inode. To be able to 
match read and write requests through socket pairs, we use 
the socket's pair (sock sk -> pair) on a write and the 
read socket itself on a read event. Currently we track only 
communications through UNIX sockets since this is the only 
socket type that is local to the machine. One could extend this 
methodology to track communications through other types of 
sockets if the communicating programs are all local to the 
machine. However, we have seen no need for this extension 
so far. 

The primary reason for tracking signals is that the thread 
library (LmuxThreads) uses signals to implement synchro- 
nization between threads. By looking at the signal activity 
we can determine how threads communicate through con- 
dition variables, mutexes. and locks. The two functions 
that needed to be instrumented are handlers icjmal and 
send_sig_info. An alternative to this approach would 
have been to instrument the thread library; however, our cur- 
rent approach is more generic and has lower overhead. 

To determine when tasks are blocked on I/O, we instru- 
mented the schedule function to record the reason why it 
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was called. If It is called from a part of the kernel that is 
related to I/O (such as the read and write system calls), 
then we assume that the task is blocked while waiting for an 
I/O event to complete. Since there is no predefined way in 
Linux to find which system call caused a transition to the ker- 
nel, we instrumented key system calls to put their id in a field 
of the executing task's task_struct. Once execution gets 
to the schedule function, our code looks at this field and 
outputs the task's reason foi giving up time. Our approach 
uses a mix of static and dynamic kernel patches. As men- 
tioned above, we have augmented a few kernel data structures 
and added "hooks" to some kernel functions. However, most 
of the patching is done dynamically by replacing vectors in 
the system call table with stubs that monitor the application 
behaviour. 

3.3. Interactive episodes 

The beginning of an interactive episode is initiated by the user 
and is usually signified by a GUI event, such as pressing a 
mouse button or a key on the keyboard. Finding the end of 
an episode is more difficult since there is no event that auto- 
matically gets generated when the computer is done respond- 
ing. To find interactive episodes, we keep track of the set of 
tasks that communicate with each other as a result of a user- 
initiated GUI event. The start of an interactive episode is ini- 
tiated by the GUI controller (X server in our case) sending a 
message through a socket to another task. When this happens 
both the GUI controller and the receiver of the task are added 
to what we refer to as the task set of the episode. If the mem- 
bers of the task set communicate with non-member tasks, then 
the as yet non-member tasks are also added. The end of the 
episode is reached when all the following conditions are met 
for tasks in the task set: 

• None of the tasks are executing. 

• Data written by the tasks have been consumed. 

• None of the tasks remains unfinished, as a results of being 
preempted the last time it ran (i.e., all tasks gave up time 
on their own by blocking m a system call). 

• None of the tasks are blocked on device I/O. 

Detecting Interactive episodes is only the first step towards 
performance prediction. Section 4.1 describes how the 
episode's deadline can be found. 

3.4. Periodic episodes 

Detecting periodic activity is similar Co detecting interactive 
episodes. However, instead of using communications with the 
X server as the trigger for starting the episode, we base this 
decision on whether the initiating task is periodic. To detect 
periodic activity, we keep track of two pieces mformation for 
each task: 

• Last execution time. 

• Length of the n last periods. 




Figure 3. Producer and consumer episodes. The figure shows communi- 
cations between a produce? and a consumer process. Hie processor can be 
slowed down to stretch the producer episode to the beginning of the consumer 
episode. i 

If a task exhibits only a small amount of variation in period 
length over the last n runs (<5%), then we treat it as a periodic 
task. 

3.4 J. Producer-consumer episodes 

Producer-consumer episodes form a special subcategory of 
periodic episodes, where the optimum performance level is 
established by the distance from producer to consumer, not . 
by the distance between periods. A case in point is the Linux 
esd sound daemon, which wakes up periodically to check for 
sound playback requests and to send data to the sound card. If 
esd's playback buffer is not empty, it sends some of the data 
to the sound card. If the buffer is close to being empty, it 
wakes up and unblocks the music decoders (e.g., MP3 play- 
ers), which causes them to generate the next few frames of 
data. 

Figure 3 illustrates interaction between a producer and 
a consumer process. The distance between the run of the 
producer and where the data is needed establishes the per- 
formance level of the producer. The producer episodes can 
be stretched to the beginning of their associated consumer 
episodes, lb determine how much the consumer can be 
slowed down, we first need to determine what the consumer 
is doing. This information can either be specified by the user 
on a per-process basis, or one can compute it by observing 
which devices or processes the consumer communicates with. 
If the task is communicating with a device, the buffer sizes 
and the speed requirements of the device establish .the rnini- 
mum speed of the consumer episode. In our current scheme 
we assume that the consumer can be slowed down to the same 
speed as the producer. 



4. Performance prediction 

Our prediction mechanism operates on a per-task basis 
and uses different algorithms for interactive and periodic 
episodes. In both cases, the predictor computes the perfor- 
mance factor, which is the ratio of the desired execution speed 
and the processor's maximum speed. 

4J. Interactive episodes 

It is difficult to come up witb a good prediction strategy for 
the optimum performance level of interactive episodes, since 
interactive episodes are completely dependent on the user, not 
on some activity within the computer. There is no predictable 
pattern of recurrence and the lengths of interactive episodes 
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Figure 4. Cumulative interactive episode length distributions. Left line shows the cumulative number, right line the cumulative percentage of time spent in 
interactive episodes whose length* axe less than or equal to the time specified on the x axis. The x axis is drawn using a logarithmic scale. Vertical tines 
fmm right to UJi: (a) 50 ms, (b) 12.5 ms and (c) 5 ms. 



can have orders of magnitude of difference. Our detection 
scheme allows us to differentiate between different types of 
episodes (i.e., interactive, producer, consumer) but cannot dis- 
tinguish between different instances of the same episode in 
the same task (e.g., when the same button is pushed in the 
GUI as before). 

We believe that the ability to distinguish between inter- 
active episode instances would improve prediction accuracy. 
However, this would require the kernel to have knowledge 
about the- location in the user program that initiated the given 
interactive episode. While not impossible, distinguishing the 
real call -sites from the kernel is difficult to do. A simple com- 
parison based on the user-mode program counter (PC) is not 
sufficient, since programs usually go through at least one level 
of indirection (through libC) when calling a system call, and 
thus all instances of a program's cans to a given system call 
would have the same user-mode PC. Moreover, since interac- 
tive episodes are usually generated as a result of GUI inter- 
action, the necessary number of indirection levels is probably 
higher due to the use of GUI libraries (e.g., gtk, Xhb). To find 
the PC value that really distinguishes one interactive episode 
from another, one would have to chase pointers through mul- 



tiple levels, where the actual number of levels depends on the 
environment (stack layout, libraries, etc.). 

Instead of basing a predictor on the ability to distinguish 
between interactive episode instances, we looked for a sim- 
pler solution that only relies on episode type (i.e., interactive, 
periodic producer or periodic consumer) for prediction. 

Figure 4 shows the cumulative distribution of interactive 
episode lengths for four interactive benchmarks. In each 
graph, there axe two cumulative distributions: the one on the 
left shows the cumulative number and the one on the right 
shows the cumulative time spent in interactive episodes of a 
given length or shorter, lb account for the large variation 
of iateractive episode Lengths, the time axis is logarithrnic. 
Three vertical lines (a, b, and c from right to left) delineate 
the perception threshold (50 ms), the point under which all 
episodes finish under the perception threshold at l/4th of 
peak performance (12.5 ms), and 1/1 Oth peak performance 
(5 ms). These values were selected because current proces- 
sors that are capable of performance and voltage scaling have 
a minimum performance of about 1 /4th peak performance, 
and future processors could possibly extend the rage of per- 
formances to 1/ 10th of peak value. 
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These graphs show that while most episodes are very short, 
the vast majority of lime is actually spent in a smal] frac- 
tion that correspond to the long episodes. For example, in 
Ghostview, 92% of the time is spent in 4% of the episodes. 
This distribution holds true on the Xemacs benchmark as 
well, however in this case even the relatively long episodes 
fall under the perception threshold. Xemacs is an example 
of an application where one could run almost all of its inter- 
active episodes in the lowest performance level without ever 
exceeding the perception threshold. 

The cumulative episode length distribution graphs imply 
that a predictor that predicts that an interactive episode only 
needs the minimum available processor performance would 
be right more than 90% of the time. However, since these 
episodes tend to make up only a small percentage of total 
time - and consequently, have a small contribution to energy 
use - it is more important to focus on accurately predicting 
the performance level of the relatively long episodes. 

Our performance-factor predictor for interactive episodes 
works by starting off with an initial performance factor, set to 
the minimum performance factor of the processor, and then 
by successively refining its value. Since the initial setting is 
only relevant for the first interactive episode, the choice of ini- 
tial value does not have a significant impact on response time 
or energy savings. The algorithm uses the following three 
steps: 

• Starts running the episode at the predicted performance 
factor. 

• At the end of the episode, computes the duration that cor- 
responds to executing at full performance. Use this infor- 
mation to compute the optimal performance factor for the 
episode. 

• Uses the weighted average of optimum performance fac- 
tors (PF) as a prediction for future performance factors. 

The main observation that we use in our predictor is that it 
is straightforward to compute what the optimum performance 
level should have been once an interactive episode is over. 
During the execution of the episode, the performance-setting 
of the processor rnigb tbe changed by external events (e.g., pe- 
riodic episodes start executing), so the algorithm must keep 
track of the observed performance factors ipf\) during the 
episode's execution. At the end of the episode, this infor- 
mation can be used to estimate how long the episode would 
have been at full performance: 
n 

7FWiSi*cd = !>/, ('/ - idled + idU h (1) 

This equation computes the full speed execution time for an 
interactive episode given n different observed performance 
levels during the episode. Hie variable t\ specifies the length 
of execution at the i th performance level during an interactive 
episode, and idlei is the corresponding amount of idle time. 

Based on the estimate of the episode execution time at 
full performance, the optimum performance level can be 
estimated for an interactive episode Equation (2) gives 



the equation for computing the optimum performance fac- 
tor for episodes where TFuUSpeed falls between the minimum- 
performance threshold and tbe PerceptionThreshold, where 
Tidie specifies the amount of idle time during the episode: 

PI optima - PeneptionTreshote - r idlc ' . } 

The mini mum- performance threshold specifies the episode 
duration that could be slowed down to the processor's min- 
imum performance level and still finish under the perception 
threshold. If the perception threshold is assumed to be 50 ms 
and the processor's minimum performance is l/4th of peak, 
then this value is 12.5 ms. Episodes that are shorter than the 
minimum-performance threshold can be run at the processor's 
minimum performance level. Episodes that are longer than 
the perception threshold need to run at full performance. 

We predict tbe performance factor for the next interactive 
episode of a given task simply as the average of the optimum 
performance factors of past interactive episodes, weighted by 
the duration of each episode; 



^pmUcUcm = — — . 



(3) 



Equation (3) shows the computation for the predicted perfor- 
mance factor based on the optimum performance factors (pf j ) 
for k past interactive episodes. Tj refers to the estimated full- 
speed time of an interactive episode. The size of k can be var- 
ied to eliminate saturation and to allow temporal variations of 
episodes lengths to affect the predictor (sec section 4.3). 

Since there can be orders of magnitude of difference be- 
tween the lengths of interactive episodes (see figure 4), this 
strategy means that the predicted performance factor for short 
episodes will almost certainly be higher than necessary. This 
effect is mitigated by the observation that short episodes have 
only a minimal impact on power consumption. 

To recover from prediction errors we set an episode- 
duration threshold, after which if the episode is still executing, 
the performance level is raised to full speed. We refer to this 
threshold as the PanicThrcshvld. While the PanicThrcshold 
can ensure that interactive performance does not degrade be- 
low a certain level, the goal of the predictor is to set the right 
performance level at the beginning of the episode, without the 
need to transition to a higher performance setting later on. 

The setting for the PamcThreshold reflects the user's toler- 
ance for worst-case performance degradation and determines 
how speculative the performance factor predictor can be. If 
the user has no tolerance for v possible performance degrada- 
tion, there is no opportunity for speculation and consequently 
energy savings. In this case one would be forced to be con- 
servative and always run at full performance to avoid mispre- 
diction errors that might extend the episode beyond the per- 
ception threshold. 

Equation (4) shows the computation for the PamcThresh- 
old for a given performance factor (PF) and perception 
threshold: 

PanicTreshold = PerceptionTr€shold(\ 4 PF). (4) 
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This formula allows a longer panic threshold when the initial 
performance setting is high, because in those cases more work 
actually gets done per unit time and therefore the cost of an 
incorrect setting (ia terms of its impact on the user) is lower. 
We also make the assumption that the user allows more per- 
formance degradation for episodes whose lengths are close to 
the perception threshold than for longer episodes. In the worst 
case, if an episode were to be exactly 50 ms at full speed, then 
its length will be stretched to 97 ms given a performance fac- 
tor of 1/4. On the other hand given the same performance 
. factor, an episode that would have been 200 ms at full speed 
would only be stretched to 247 ms. 

4, LI. The perception threshold 

In this paper we use a range of perception thresholds dur- 
ing some of our experiments. Our motivations for this are 
twofold: 

* The higher perception thresholds allow us to estimate the 
energy and interactive characteristics on a future, higher- 
performance processor. The 100 ms threshold on today's 
processor roughly corresponds to the 50 ms threshold on a 
processor with twice the performance. 

• The perception threshold varies by individual and task and 
may be used as a user scttable indicator for his preference 
for high-performance or energy savings. 

Literature about human-computer interaction [1,12] in- 
dicates that 20-30 frames per second are sufficient for the 
human visual system to perceive the images as a continu- 
ous stream. This suggests that the perception threshold is 
around 50 ms. Human subject tests in [1] show that percep- 
tual causality - when two events are perceived to be fused 
together - ends around 100 ms, and for some test subjects 
quality degradation begins at around 50 ms. 

Other experiments have shown that for simple operations, 
such as dragging an object through the screen, as few as 5 up- 
dates per second are sufficient to maintain an interactive feel 
(200 ms perception threshold). For non-continuous opera- 
tions, as much as 1-2 s delays are acceptable [121. However, 

No MP3 




when human motor operations form a feedback loop with vi- 
sual activity, then it is more important to have a faster re- 
sponse time. 

4,2, Incorporating periodic episodes 

The optimum performance factor for periodic activity can be 
computed easily by either stretching the periodic episode's 
execution to the beginning of the next episode or to the begin- 
ning of the associated consumer episode. Since periodic (such 
as video or sound playback) applications sometimes adjust the 
quality of playback based on processor performance, it is im- 
portant to switch to full performance when a periodic appli- 
cation starts executing, so that it has a chance to adapt to the 
highest performance level. Our assumption is that the user's 
emphasis is on service quality over energy savings. Others 
have addressed the tradeoff where service quality can be re- 
duced to save energy [4], 

An important consideration is to find the performance fac- 
tor when interactive episodes are present in addition to the 
periodic activity. Our strategy is very simple: 

• When there is no interactive episode executing on the 
processor, we set the performance factor to the one com- 
puted for the periodic activity. 

• At the beginning of an interactive episode we switch to 
the performance factor that was predicted for the task's 
interactive episodes, if it is higher than the periodic per- 
formance factor. 

Figure 5 illustrates this strategy during two runs of the Ac- 
robat Reader benchmark. When there is no periodic activity, 
performance is determined only by the prediction for inter- 
active episodes. However, when periodic activity is present, 
the algorithm switches between the two performance lev- 
els, causing significantly more performance transitions. The 
spikes that transition the processor to full performance are 
triggered by interactive episodes whose lengths exceed the 
PanicThreshold. Aside from the initial start at full perfor- 
mance when MP3 is executing in the background, there is 



MP3 playbaok in background 




LU 



Figure 5. Performance factor settings during the execution of the. Acrobat Redder benchmark. Two runs of the Acrobat Reader benchmarks are shown side 
by side with and without MP3 playback during the run. Perception threshold was set to 20Q ms, and data was generated using our simplest strategy (Basic 
predictor and XSB model without quantization, see section 6.2). More sophisticated models have significantly fewer pcrfonnaree- level transitions when MP3 
is executing in the background. Spikes to full performance represent instances wbea the PanicThreshold was reached. Note that the graphs do not show 
transitions to sleep mode. 
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only one extra transition due to reaching the PanicThreshold 
on the second figure. 

Periodic episodes have ihe effect of extending the run- 
times of interactive episodes [3], which means that the inter- 
active performance-factor predictor should be updated when 
periodic activity starts. Instead of making the prediction for- 
mula mare complicated, our approach allows the performance 
predictor to quickly adapt to the presence of the periodic ac- 
tivity. The next section describes how this is done exactly. 

43, Implementation details of the Basic predictor 

The main features of the Basic predictor are summarized be- 
low: 

• Interactive episode performance level prediction based on 
optimum performance factors of past episodes. Estimates 
of the optimal performance factors are computed at the end 
of each interactive episode. 

• PanicThreshold bounds worst-case performance. 

• Periodic performance level computed by observing peri- 
odic episodes and their communication patterns. 

• Switches between periodic and interactive performance 
factors depending on which episode is executing. 

An attractive feature of the predictor is that it requires very 
little state. We use two variables per task: one keeps track 
of the sum of episode length weighted performance factors 
(totalPF) and the other keeps track of the total time spent in 
episodes {totalTime). In both cases time is the estimated full- 
speed execution time of the episode. The performance factor 
prediction is thus totalPF divided by totalTime. 

One problem with an averaging based predictor is that if 
the execution time is long, then temporal variation may not 
influence the predictor for a very long time. One way of alle- 
viating this problem is periodically rescaling the variables by 
dividing them both by the same amount This way the predic- 
tor can better accommodate a changing workload. 

Performance prediction for interactive episodes in the 
presence of periodic activity relies on rescaling to allow the 
predictor to adapt to the changing workload. Our studies 
in [3] have shown that even lightweight background activ- 
ity, such as MP3 playback, extends the duration of percep- 
tible interactive episodes by an average of 14%. This implies 
that performance factors predicted based on data without the 
background activity would underestimate the necessary per- 
formance. To alleviate this problem, when periodic activity is 
detected, the totallime variable is set to 100 ms and totalPF 
is recomputed based on the new value. While providing a 
reasonable initial prediction, this change allows new perfor- 
mance factor data to take hold quickly. 

5. Simulation methodology 

Our simulator is driven by traces collected using a modified 
Linux kernel (2.3.99-pre3) running on a Dell Precision work- 
station 410, with only one of the two Pentium 11 450 MHz 



processors enabled (512M RAM). The software environment 
was Mandrake Lima 7 with Helix Gnome 1 .2. Hie traces 
used in this study are the same as the uniprocessor traces used 
in [3]. All benchmarks were run by a live user. While we have 
collected multiple runs in each configuration, in this paper we 
only use a single run for each simulation. We aimed to re- 
peat each run with MP3 in the background as accurately as 
possible, but there are slight variations between the runs. All 
the significant events (e.g., mouse clicks, text entry) were per- 
formed in the same order during each benchmark run. How- 
ever, the exact path of mouse movement (and therefore the 
interactive episodes corresponding to them) and the amount 
of time between events varies from one run to the other. 

The traces include all significant OS events during the 
benchmarks execution: thread swap events, system calls, and 
task information (e.g., name, pid, etc.). Based on this in- 
formation, our simulator can reconstruct the communication 
events between the tasks (which imply the synchronization 
points between them) and simulate the effects of performance 
scaling. The upside of our methodology is that we have the 
flexibility to investigate a wide set of architectural parameters. 
The downside is that the absence of actual hardware prevents 
us from measuring total energy consumption and from cali- 
brating our results. 



6. Energy and performance Implications 

Our aim is to develop a performance scaling technique that 
can guarantee that user-perceived performance does not de- 
grade below a user-settabte level. A detailed microarchitec- 
ture-level power analysis is beyond the scope of this paper; 
however, we can derive some estimates regarding the ex- 
pected energy savings using a few simple assumptions. 

The metric we use is the energy factor, which is the ra- 
tio of the energy used by the scaled workload divided by its 
predicted energy use at full performance. Equation (5) gives 
the energy factor formula for a given workload, assuming 
that the workload is divided into n pieces that execute at the 
scaled voltage (rj) and frequency (/* , specified in MHz) for 
Ihe scaled amount of time fo, in s). T refers to the total exe- 
cution time at mil speed, while the max subscript refers to the 
maximum value of the given variable: 



my 



EnergyFactor = ^ 8g| ****** 
2 



(5) 



Our model focuses on the CPU alone and does not take the 
power consumption of other devices (such as memory and 
peripherals) into account. 

6.7. Processor and power scaling model 

Our performance model is based on assumptions from the In- 
tel XScale microarchitecture [2]. Our traces were collected 
on a 450 MHz Pentium II based workstation, and we make 
the simplifying assumption that these traces correspond to the 
full-speed performance on each of the simulated models. 
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Tiblel 

Frequency-voltage pairs in our energy models. The table shows the given 
frequency-voltage pain of our models. Dashes represent frequency levels 
thai are not supported in a. given model. 



Model 






Frequency (MHz) 






150 


333 


400 


600 


773 


800 


1000 


XSBase 




1 


LI 


1.3 


1.5 






XSA 


0.75 


1 


1J 


1.3 


1.5 






XSB 


0.75 






1.2 




1.4 


1,75 



We assume that for each performance transition, there is 
a 20 ms pause, during which the processor does not execute 
any instructions. This pause is due to the time it take to re- 
synchronize Che PLLs for the changed frequency value. After 
this, the performance transition time - during which the volt- 
age level is changed - is assumed to be 1 ms regardless of 
starting and ending performance level. During this time, we 
assume that the processor is executing instructions at the rate 
corresponding to the lower of the two performance levels, but 
energy is being consumed at the higher. 

Tfcble 1 shows the known frequency-voltage values that we 
used to compute voltage equations (equations (5)-(7)) for ar- 
bitrary frequency levels between the minimum and maximum 
frequencies. The XSBase model corresponds closely to the 
high-end XScale part (80200M733) described in [2]: 

vxSBese - -5 x 10~ 8 / 2 + 0.0012/ + 0.6261. (6) 

However, since this model only has a 2.32 x frequency 
range, we extended it to 5.15 x by allowing it to go as low 
as 150 MHz in the XSA model: 



wxsa = -4x 10~ 7 / 2 + 0.0015/ + 0.5324. 



(7) 



The XSB models the parameters of a high-end device that 
is a research prototype. This processor can vary its perfor- 
mance between 150 MHz and 1000 MHz for a 6.67 x swing: 



vxsb = 5 x 10~ 7 / 2 + 0.0005/ + 0.6624. 



(8) 



We use the energy factor as our metric for computing en* 
ergy reduction. It is the ratio of a given workload 's energy 
consumption using our performance-setting strategy over its 
energy consumption using the processor's peak performance. 
In all our energy calculations, we assume that the OS power 
manager puts the processor into a low power sleep mode im- 
mediately when no instructions are executing. We do not at- 
tribute a power cost to this operation and assume that it hap- 
pens instantaneously. 

During our evaluation we specify a quantization factor 
for each of the power models. On an actual processor not 
all frequency-voltage pairs can be directly set up, one must 
choose from a set of predetermined values. This means that 
when the performance estimator requests a given performance 
setting, the actual performance value is rounded up to the 
next quantum. In our experiments we mostly focus on mod- 
els where frequency is quantized at 5% steps (100%, 95%, 
90%. etc.). We denote quantized models with the suffix 'q* 
followed by the quantum size. 
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f>, 2. Performance and energy characteristics 

In this section we examine the characteristics of the basic per- 
formance setting algorithm and propose some improvements. 
Our evaluation focuses on the following three main goals: 

• Minimizing the number of rjerformance- level transitions. 

• Minimizing the amount of increase in the duration of per- 
ceptible interactive episodes. 

• Maximizing energy reduction. 

These three aspects are closely interrelated. Reducing Ihe 
number of performance-level transitions is important, because 
each transition has a delay and energy cost that negatively af- 
fects both the perceptible performance and energy savings. 
On the other hand, increasing (he interactive episode duration 
has a positive effect on energy savings, because the longer in- 
teractive episodes may stretch, the slower they can run. How- 
ever, this has a negative impact on the user- perceived perfor- 
mance. While the increase in perceptible interactive episode 
duration in all cases falls within the acceptable range (since 
ensuring this is part of the r^rforrnance-setting algorithm's 
job, see section 4.3), we seek to minimize it, i.e., our method- 
ology favors raster response time over energy savings. 

Hie perceptible interactive episode-length increase is com- 
puted for all scaled episodes that fall above the perception 
threshold by dividing the scaled episode length by either the 
full-speed episode length or the perception threshold, depend- 
ing on whether the original episode length was longer or 
shorter (respectively) than the perception threshold. 

Table 2 shows our baseline results using the XSB model 
(without quantization) and 5(J ms perception threshold and as- 
sumptions described in section 6.1. The mean perceptible in- 
teractive episode length increase in all cases is under 30%. 
Applications that have many short episodes (e.g., Xemacs 
and Netscape) tend to have the largest increase, while work- 
loads with long episodes (e.g., Ghostview and GIMP) exhibit 
the smallest increase. This makes sense given that our ac- 
ceptable delay function (PanicThneshotd) allows more perfor- 
mance degradation for shorter episodes than for longer ones. 
One should also note that the number of performance tran- 
sitions increases significantly (up to four times) when MP3 
playback is running in the background. This is because, when 
a periodic episode is running, the performance setting algo- 
rithm alternates between the setting for interactive episodes 
and the setting for the periodic task. The energy factor tends 
to be lower when MP3 is running than without The reason 
for this is that in most cases the MP3 player requires a lower 
performance-setting than the interactive application. Xemacs 
is an exception: the benchmark's interactive episodes require 
a lower perfcrrnance-setting than the MP3 player. 

Table 3 illustrates the effects of quantization on the results. 
This model corresponds more closely to an actual hardware 
implementation than ihe one used in Che previous table. Quan- 
tization tends to slightly reduce energy savings and also re- 
duce perceptible interactive episode lengths. The only bench- 
mark where this was not the case is Xemacs, where quantiza- 
tion corrects a few mispredictions, causing both an increase 
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Tabb2 

Performance characteristics (XSB, 50 ms perception threshold). 



Beiidirnarks 




No MP3 in background 






MP3 playback in 


background 






Performance 


Mean perceptible 


Median perceptible 


Energy 


Performance 


Mean perceptible 


Median perceptible 


Energy 




transitions 


IE length increase 


IE length increase 


factor 


transitions 


IE length increase 


CE length increase 


factor 


Acrobat Reader 


543 


13% 


7% 


0.91 


668 


13% 


11% 


0.84 


Frame VfflWfT 


155 


20% 


11% 


0.89 


191 


9% 


7% 


0.75 


Ghostview 


510 


5% 


1% 


0.98 


1149 


5% 


1% 


0.91 


GIMP 


919 


3% 


4% 


0.97 


1731 


5% 


4% 


0.91 


Netscape 


3026 


18% 


14% 


0.87 


1739 


21% 


12% 


0.82 


Xemacs 


381 


23% 


20% 


0.30 


1417 


29% 


33% 


0.34 








Table 3 














'Performance characteristics (XSBq5, 50 ms perception threshold). 






Benchmarks 




No MP3 in background 






MP3 playback in 


background 






Performance 


Mean perceptible 


Median perceptible 


Energy 


Performance 


Mean perceptible 


Median perceptible 


Energy 




transitions 


IE length increase 


IE length Increase 


factor 


transitions 


IE length Increase 


IE length increase 


factor 




28 


12% 


5% 


0.92 


664 


12% 


10% 


0.86 


Prom/ 1 VI n Icpr 
K. 1 witecn OAS* 


15 


37% 


11% 


0.90 


184 


7% 


6% 


0.77 


f*rhofitv i nv 


10 


4% 


0% 


0.99 


1135 


4% 


0% 


0.92 


GIMP 


28 


4% 


3% 


058 


1533 


5% 


3% 


0.92 


Netscape 


32 


17% 


12% 


0.88 


1547 


20% 


11% 


0.84 


Xemacs 


15 


15% 


14% 


0.26 


1416 


29% 


31% 


0.32 








Table 4 












. Performance characteristics 


with MP3 playback in background (XSBqS, 50 


ms perception threshold). 




Benchmarks 




MP3 








MP3 










lEPerf transition start 


latency = 1 ms 






IE? erf transition start 


latency = 5 rns 






Performance 


Mean perceptible 


Median perceptible 


Energy 


Performance 


Mean perceptible 


Median perceptible 


Energy 




transitions 


IE length increase 


IE length increase 


factor 


transitions 


IE length increase 


IE length increase 


factor 


Acrobat Reader 


637 


11% 


4% 


0.86 


125 


13% 


6% 


0.75 


FramcMaker 


153 


8% 


7% 


0.76 


81 


12% 


9% 


0.73 


Ghostview 


1031 


5% 


1% 


0.91 


222 


6% 


1% 


0.86 


GIMP 


854 


5% 


4% 


0.88 


334 


6% 


6% 


0.83 


Netscape 


1072 


18% 


12% 


0.83 


340 


20^ 


1.4% 


0.72 


Xeroses 


1047 


29% 


32% 


0.32 


980 


34% 


39% 


0.31 



in energy savings and a decrease in the average perceptible 
episode length. 

Perhaps the most striking improvement over the data in ta- 
ble 2 is the dramatic reduction in the number of perforraance- 
level transitions (in some cases more than 300- fold) when no 
MP3 playback is running concurrently with the interactive ap- 
plication. This behaviour is also due to the fact that when 
there is no periodic background activity, the successively pre- 
dicted performance levels are close to each other, causing 
quantization to eliminate the minor corrections. When MP3 
playback is present, the deliberate transitions between inter- 
active and periodic modes keeps the number of transitions 
high. 

The number of performance transitions can be further re- 
duced based on an observation of figure 4. We have pointed 
out that the majority of interactive episodes are very short 
(less than 1 ms) and that very little time is spent in those 
episodes (<5%). When there is no periodic background ac- 
tivity, the effect of the short interactive episodes is negli- 
gible since the perforrnarjce level is always set at the pre- 



dicted interactive perforniance level, however, when there 
is an MP3 player in the background, these short interactive 
episodes cause an unnecessary transition from the periodic 
to the interactive peifonnance level. When the interactive 
episode is very short, this transition simply wastes energy, 
si nee the episode is likely to be finished before the transition 
is over. 

This observation suggests a strategy that waits for a cer- 
tain amount of time before starting a transition to the inter- 
active performance level. Table 4 illustrates the effects of a 
transition-start latency before interactive episodes. The data 
is only shown for the benchmarks with MP3 playback in the 
background, because there was no significant change in re- 
sults when background activity was not present. 

Contrasting with table 3 shows that a 1 ms transition- 
start latency leaves the energy factors mostly unchanged but 
causes a small reduction in the number of performance tran- 
sitions. Extending the transition-start latency to 5 ms causes 
both a significant reduction in energy consumption and in the 
number of perforrnance transitions (up to a 5-fold reduction). 



PAGE 15/18 ' RCVD AT 1/12/2005 2:53:44 PM [Eastern Standard Time] • SVR:USPTO-EFXRF-1/25 * DNIS:2731797 * CSID:170776255O4* DURATION (mnvss): 13-08 



Jan 12 2005 12:02 LRU OFFICE MARK C PICKERI 17077625504 



p. 16 



518 



NO MPS 



FLAUTNER, REIN11ARDT ANTJ MUDGB 

MPS playback In background 



















******* 














PmK CPU p^Torn»rn« 



Figure 6. Energy factors corresponding to different perception thresholds using three quantized (5%) models. These graphs show results using (he Enhanced 
predictor corresponding to a variety of perception thresholds. At each perception threshold level, we show the energy factors for the QSB&seQS (top point), 
XSAo5 (middle point, connected) and X5Bq5 (bottom point) models. 



Moreover, the average perceptible interactive episode-lengths 
slay at around the same level as before. 

63. The Enhanced predictor 

The previous section suggests two minor changes to the Basic 
predictor: (1) to quantize the allowable performance levels; 
and (2) to wait for a certain time before changing the perfor- 
mance factor when an interactive episode starts. For simplic- 
ity we only use a statically specified transition start latency of 
5 ms in the Enhanced predictor. A more sophisticated predic- 
tor could dynamically compute a per-process value. 

Figure 6 shows the energy factors using the Enhanced pre- 
dictor and the XSBaseQ5, XSAq5 and XSBq5 power models 
given a variety of perception thresholds. Our results show that 
while on the measurement machine there is little opportunity 
for power savings, as the peak performance of the proces- 
sor gets faster, eneigy savings will be more pronounced. We 
must note that our traces were collected on a 450 MHz Pen- 
tium II machine, and today's high-end processors are already 
2-3 times faster. We estimate that the energy factor on to- 
day's high-end desktops could be in the 10-75% range at the 
50-1 00 ms perception threshold. 

In the figure, energy factors corresponding to the XSAq5 
model are connected by a line at each perception threshold, 
the results for the XSBaseQ5 and XSBq5 are shown as the 
bars above and below each point (respectively). In all cases, 
XSBq5 achieves the largest energy savings, while XSBaseQ5 
achieves the least. For most applications the difference be- 
tween the three models is small: in most cases less than 5%. 
However, the difference is significant on the Xemacs bench- 
mark runs, since this application spends most of its time at 
the lowest performance setting allowed in each model. In this 
case, the lower minimum performance levels of the XSA and 
XSB models give them a significant edge. 

Periodic activity has the effect of vertically compressing 
the graph towards its center. The load due to the background 



activity increases the energy savings for benchmarks with 
long interactive episodes. On the other hand, one can ob- 
serve the exact opposite effect when the interactive episodes 
are short (Xemacs). 

6,4. Desired hardware improvements 

We have already shown that by reducing the number of 
performance-level changes, quantization can have a positive 
effect on both energy savings and Interactive perfcrrmance. 
In our measurements we found that the quantum size does 
not greatly impact our results. Using the Enhanced predictor, 
the difference in energy savings due to quantum sizes of 5%, 
10%, and 20% are negligible. While, in all cases the 5% quan- 
tum size has the smallest energy factor, the 20% quantum size 
is only behind by at most 1%. Using Oracle information, the 
difference in energy savings due to different quantum sizes 
would always be under 5%. 

For all our measurements we assume that there is a 20 ^is 
pause when a performance transition is initiated, and that a 
transition takes 1 ms. The current pause duration, which we 
believe is indicative of what can realistically be expected, is 
short enough that ehminating it has neither a significant im- 
pact on energy savings, nor on perceptible interactive perfor- 
mance. However, there might be other reasons for lowering 
it, such as latencies incurred during communication with pe- 
ripheral devices. 

Reducing (he lengths of performance transitions, on the 
other hand, has a positive effect on both perceptible perfor- 
mance and energy savings. While shortening performance 
transitions has an overall positive effect, a better prediction 
mechanism (as shown by the Oracle predictor) could achieve 
even more significant improvements. We believe that the En- 
hanced predictor could be improved by giving it the ability 
to distinguish between episode instances (as discussed in sec- 
tion 4.1), not just episode types. 
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7. Conclusions and future work 

lit this paper we describe an automatic episode detection 
mechanism that can be used to guide the performance-setting 
decisions for a processor that supports dynamic performance 
and voltage scaling. This system can derive and predict 
episode deadlines automatically, without the need to modify 
existing user programs. We have shown that our approach 
can achieve significant energy savings while ensuring that in- 
teractive performance stays at an acceptable leveL We are 
currently working on the evaluation of our algorithms on a 
system thai is capable of dynamic voltage scaling. 

While our current implementation is tied closely to the 
Linux kernel and its application environment, we believe that 
the ideas proposed in this paper are also applicable to other 
operating systems. We developed our methodology for Linux 
by observing common program design and communication 
patterns. While the specifics may vary from one OS to an- 
other, most modern operating systems have abstractions that 
a similar monitoring environment could be built on (i.e.. inter- 
process communication, multithreading, system calls). 

Our mechanism works without modifications of user pro- 
grams, however an optional API might be useful for appli- 
cations that want to take full advantage of performance scal- 
ing. One of the biggest shortcomings of the current predictor 
is its inability to distinguish episode instances from one an- 
other. An API that would allow the programmer to delineate 
and name critical episodes and perhaps optionally specify its 
type and deadline would help. The API might consist of the 
following system calls: 

episodej^egin <id> [type] [deadline] 
episode_end <id> 

The i d is a per-lask identifier assigned by the programmer to 
distinguish one episode from another. The type field option- 
ally categories. The deadline field optionally specifies the 
maximum length of the episode. The idea behind this AH is 
that its main role is to give hints to the existing prediction and 
communication-tracking mechanism, instead of superseding 
it For example, there is no need to specify dependencies 
between episodes since that information can be derived au- 
tomatically from information in the kernel. 

We have shown that along with peak performance, it is 
also important to allow the processor to run slowly. While 
there will always be applications that can only run acceptably 
at the processor's fastest setting, an increasing number of ap- 
plications are able to take advantage of the lower performance 
modes of the processor. As peak performance of the proces- 
sor increases, it is important to widen the gap between the 
minimum and maximum performance levels of the processor. 

We believe that the core idea of our technique - on-line 
monitoring and dynamic adaptation - could be extended to 
allow the kernel to make better scheduling and service-quality 
decisions. 
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